Automatic Lexicon Enhancement by Means of Corpus Tagging

نویسندگان

  • Frédéric Béchet
  • Thierry Spriet
  • Marc El-Bèze
چکیده

Using specialised text corpus to automatically enhance a general lexicon is the aim of this study. Indeed, having lexicons which offer maximal cover on a specific topic is an important benefit in many applications of Automatic Speech and Natural Language Processing. The enhancement of these lexicons can be made automatic as big corpora of specialised texts are available. A syntactic tagging process, based on 3-class and 3-gram language models, allows us to automatically allocate possible syntactic categories to the Out-Of-Vocabulary (OOV) words which are found in the corpus processed. These OOV words generally occur several times in the corpus, and a number of these occurrences can be important. By taking into account all the occurrences of an OOV word in a given text as a whole, we propose here a method for automatically extracting a specialised lexicon from a text corpus which is representative of a specific topic.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Software Tools for Morphological Tagging of Zulu Corpora and Lexicon Development

The aim of this paper is to discuss aspects of an on-going project on the development of grammatical and lexical resources for Zulu with sufficient coverage for unrestricted text. We explain how the basic software tools of computational morphology are used in linguistic processing, more specifically for automatic word form recognition and morphological tagging of the growing stock of electronic...

متن کامل

CEPLEXicon ― A Lexicon of Child European Portuguese

CEPLEXicon (version 1.1) is a child lexicon resulting from the automatic tagging of two child corpora: the corpus Santos (Santos, 2006; Santos et al. 2014) and the corpus Child – Adult Interaction (Freitas et al. 2012), which integrates information from the corpus Freitas (Freitas, 1997). This lexicon includes spontaneous speech produced by seven children (1;02.00 to 3;11.12) during approximate...

متن کامل

Large-Coverage Root Lexicon Extraction for Hindi

This paper describes a method using morphological rules and heuristics, for the automatic extraction of large-coverage lexicons of stems and root word-forms from a raw text corpus. We cast the problem of high-coverage lexicon extraction as one of stemming followed by root word-form selection. We examine the use of POS tagging to improve precision and recall of stemming and thereby the coverage ...

متن کامل

Automatic Detection of Collocation

Collocation is a very important relation between words, which can be widely applied to semantic parsing (e.g., word sense disambiguation), machine translation (e.g., automatic alignment of bilingual corpus), computational lexicon, etc. Firstly, we summarized the methods of likelihood interval, likelihood ratio test, u test and χ test for collocation theoretically, and then utilized them to extr...

متن کامل

Development of a Pediatric Text-Corpus for Part-of-Speech Tagging

Most efforts in natural language processing (NLP) have been devoted to understanding general domain data. Special domains, such as pediatric medicine, pose some unique problems and challenges. While many common sense corporas and lexicons have been created we know of none directly related to pediatric medicine. This article presents the status of an ongoing project to create a large corpus and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997